library(readr)
brands <- read.csv(file="BelkinElagoComplete.csv", header = TRUE, sep = ";")
head(brands, 10)
## salary age elevel car zipcode credit brand
## 1 123476.6 23 4 1 0 779.56 Elago
## 2 120274.6 22 4 1 2 784.70 Elago
## 3 121735.5 26 4 1 1 749.35 Elago
## 4 138276.8 23 4 1 0 743.85 Elago
## 5 126869.2 23 4 1 0 759.03 Elago
## 6 130595.1 22 4 1 7 774.13 Elago
## 7 121358.0 80 4 1 2 732.03 Belkin
## 8 127457.4 71 4 1 5 744.29 Belkin
## 9 137670.5 75 4 1 0 815.00 Belkin
## 10 120088.7 59 4 1 7 740.48 Belkin
library(funModeling)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(tidyverse)
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.1 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ purrr 0.3.2 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
library(Hmisc)
Number of observations (rows) and variables, and a head of the first cases
glimpse(brands)
## Observations: 10,000
## Variables: 7
## $ salary <dbl> 123476.6, 120274.6, 121735.5, 138276.8, 126869.2, 130595…
## $ age <int> 23, 22, 26, 23, 23, 22, 80, 71, 75, 59, 73, 59, 77, 58, …
## $ elevel <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ car <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ zipcode <int> 0, 2, 1, 0, 0, 7, 2, 5, 0, 7, 6, 6, 2, 1, 5, 3, 5, 7, 8,…
## $ credit <dbl> 779.56, 784.70, 749.35, 743.85, 759.03, 774.13, 732.03, …
## $ brand <fct> Elago, Elago, Elago, Elago, Elago, Elago, Belkin, Belkin…
Checking missing values, zeros, data type, and unique values Probably one of the first steps, when we get a new dataset to analyze, is to know if there are missing values (NA in R) and the data type.
The df_status function coming in funModeling can help us by showing these numbers in relative and percentage values. It also retrieves the infinite and zeros statistics.
df_status(brands)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 salary 0 0.00 0 0 0 0 numeric 9757
## 2 age 0 0.00 0 0 0 0 integer 61
## 3 elevel 0 0.00 0 0 0 0 integer 4
## 4 car 0 0.00 0 0 0 0 integer 20
## 5 zipcode 1097 10.97 0 0 0 0 integer 9
## 6 credit 0 0.00 0 0 0 0 numeric 8651
## 7 brand 0 0.00 0 0 0 0 factor 2
We see that there are 1097 examples or 11% of the variable zipcode with the zero value. As we know that it is regular zip code value it is fine and we want to transform this variable to categorical. We see that the variable elevel should also be transformed to categorical. Let’s do that now:
brands$elevel <- as.factor(brands$elevel)
brands$zipcode <- as.factor(brands$zipcode)
df_status(brands)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 salary 0 0.00 0 0 0 0 numeric 9757
## 2 age 0 0.00 0 0 0 0 integer 61
## 3 elevel 0 0.00 0 0 0 0 factor 4
## 4 car 0 0.00 0 0 0 0 integer 20
## 5 zipcode 1097 10.97 0 0 0 0 factor 9
## 6 credit 0 0.00 0 0 0 0 numeric 8651
## 7 brand 0 0.00 0 0 0 0 factor 2
Now that we have converted our variables to factors, we can run the function freq() that runs for all factor variables. We get some nice plots and tables as the result.
freq(brands)
## elevel frequency percentage cumulative_perc
## 1 2 4404 44.04 44.04
## 2 3 3497 34.97 79.01
## 3 1 1435 14.35 93.36
## 4 4 664 6.64 100.00
## zipcode frequency percentage cumulative_perc
## 1 6 1167 11.67 11.67
## 2 8 1144 11.44 23.11
## 3 2 1125 11.25 34.36
## 4 5 1113 11.13 45.49
## 5 4 1103 11.03 56.52
## 6 0 1097 10.97 67.49
## 7 3 1094 10.94 78.43
## 8 7 1091 10.91 89.34
## 9 1 1066 10.66 100.00
## brand frequency percentage cumulative_perc
## 1 Elago 5348 53.48 53.48
## 2 Belkin 4652 46.52 100.00
## [1] "Variables processed: elevel, zipcode, brand"
We will see: plot_num and profiling_num. Both run automatically for all numerical/integer variables:
plot_num(brands)
profiling_num(brands)
## variable mean std_dev variation_coef p_01 p_05
## 1 salary 84897.2509 37709.806791 0.4441817 20000.0000 25921.608
## 2 age 49.8115 17.596322 0.3532582 20.0000 22.000
## 3 car 10.4728 6.023661 0.5751719 1.0000 1.000
## 4 credit 632.0648 91.371581 0.1445605 445.4995 482.457
## p_25 p_50 p_75 p_95 p_99 skewness
## 1 52109.2827 84968.94 117168.2605 143297.0010 150000.000 -0.006945249
## 2 35.0000 50.00 65.0000 77.0000 80.000 0.008270035
## 3 5.0000 10.00 16.0000 20.0000 20.000 0.058502642
## 4 563.3825 632.07 701.2975 783.5315 815.005 0.002483984
## kurtosis iqr range_98 range_80
## 1 1.814148 65058.978 [20000, 150000] [32480.123939, 137027.67826]
## 2 1.800191 30.000 [20, 80] [26, 74]
## 3 1.718976 11.000 [1, 20] [3, 19]
## 4 2.235649 137.915 [445.4995, 815.005] [508.999, 757.386]
describe from Hmisc package
library(Hmisc)
describe(brands)
## brands
##
## 7 Variables 10000 Observations
## ---------------------------------------------------------------------------
## salary
## n missing distinct Info Mean Gmd .05 .10
## 10000 0 9757 1 84897 43544 25922 32480
## .25 .50 .75 .90 .95
## 52109 84969 117168 137028 143297
##
## lowest : 20000.00 20105.59 20140.31 20141.14 20218.82
## highest: 149919.44 149964.88 149973.46 149994.21 150000.00
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 10000 0 61 1 49.81 20.32 22 26
## .25 .50 .75 .90 .95
## 35 50 65 74 77
##
## lowest : 20 21 22 23 24, highest: 76 77 78 79 80
## ---------------------------------------------------------------------------
## elevel
## n missing distinct
## 10000 0 4
##
## Value 1 2 3 4
## Frequency 1435 4404 3497 664
## Proportion 0.144 0.440 0.350 0.066
## ---------------------------------------------------------------------------
## car
## n missing distinct Info Mean Gmd .05 .10
## 10000 0 20 0.997 10.47 6.937 1 3
## .25 .50 .75 .90 .95
## 5 10 16 19 20
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 634 280 841 420 445 395 722 749 264 429
## Proportion 0.063 0.028 0.084 0.042 0.044 0.040 0.072 0.075 0.026 0.043
##
## Value 11 12 13 14 15 16 17 18 19 20
## Frequency 312 423 443 423 636 409 420 533 423 799
## Proportion 0.031 0.042 0.044 0.042 0.064 0.041 0.042 0.053 0.042 0.080
## ---------------------------------------------------------------------------
## zipcode
## n missing distinct
## 10000 0 9
##
## Value 0 1 2 3 4 5 6 7 8
## Frequency 1097 1066 1125 1094 1103 1113 1167 1091 1144
## Proportion 0.110 0.107 0.112 0.109 0.110 0.111 0.117 0.109 0.114
## ---------------------------------------------------------------------------
## credit
## n missing distinct Info Mean Gmd .05 .10
## 10000 0 8651 1 632.1 104.9 482.5 509.0
## .25 .50 .75 .90 .95
## 563.4 632.1 701.3 757.4 783.5
##
## lowest : 416.59 419.73 421.85 422.08 422.59, highest: 846.38 847.65 847.73 848.56 849.00
## ---------------------------------------------------------------------------
## brand
## n missing distinct
## 10000 0 2
##
## Value Belkin Elago
## Frequency 4652 5348
## Proportion 0.465 0.535
## ---------------------------------------------------------------------------
We should pay special attention to variable car. It is just not realistic that our customers have on average 10 cars or that the most of the customers have 18 cars or more. It would be more realistic that we are dealing here with codes for car brands, so maybe we should convert the variable cars to factor.
brands$car <- as.factor(brands$car)
freq(brands$car)
## var frequency percentage cumulative_perc
## 1 3 841 8.41 8.41
## 2 20 799 7.99 16.40
## 3 8 749 7.49 23.89
## 4 7 722 7.22 31.11
## 5 15 636 6.36 37.47
## 6 1 634 6.34 43.81
## 7 18 533 5.33 49.14
## 8 5 445 4.45 53.59
## 9 13 443 4.43 58.02
## 10 10 429 4.29 62.31
## 11 12 423 4.23 66.54
## 12 14 423 4.23 70.77
## 13 19 423 4.23 75.00
## 14 4 420 4.20 79.20
## 15 17 420 4.20 83.40
## 16 16 409 4.09 87.49
## 17 6 395 3.95 91.44
## 18 11 312 3.12 94.56
## 19 2 280 2.80 97.36
## 20 9 264 2.64 100.00
library(ggplot2)
str(brands)
## 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 123477 120275 121736 138277 126869 ...
## $ age : int 23 22 26 23 23 22 80 71 75 59 ...
## $ elevel : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ zipcode: Factor w/ 9 levels "0","1","2","3",..: 1 3 2 1 1 8 3 6 1 8 ...
## $ credit : num 780 785 749 744 759 ...
## $ brand : Factor w/ 2 levels "Belkin","Elago": 2 2 2 2 2 2 1 1 1 1 ...
Create histograms for numerical variables
salary_hist <- ggplot(brands, aes(x=brands$salary)) + geom_histogram(bins=10) + theme_classic()
age_hist <- ggplot(brands, aes(x=brands$age)) + geom_histogram(bins=12) + theme_classic()
credit_hist <- ggplot(brands, aes(x=brands$credit)) + geom_histogram(bins=10) + theme_classic()
salary_hist
age_hist
credit_hist
Create bar charts for categorical variables
elevel_bar <- ggplot(brands, aes(x = elevel)) + geom_bar()
car_bar <- ggplot(brands, aes(x = car)) + geom_bar()
zip_bar <- ggplot(brands, aes(x = zipcode)) + geom_bar()
brand_bar <- ggplot(brands, aes(x = brand)) + geom_bar()
elevel_bar
car_bar
zip_bar
brand_bar
Relations between the variables
Let us first see the relations between education level and salary_hist:
elevel_salary <- ggplot(brands, aes(x = elevel, y = salary)) +
geom_boxplot()
elevel_salary
There is linear relationship between salary and education level. The lowest educated customers earn on average around 40000 and the highest educated almost 140000.
We can also plot variable brand by setting the color parameter:
elevel_salary_brand <- ggplot(brands, aes(x = elevel, y = salary, color = brand)) +
geom_boxplot()
elevel_salary_brand
brand_age <- ggplot(brands, aes(x = brand, y = age)) +
geom_boxplot()
brand_age
brand_salary <-ggplot(brands, aes(x = brand, y = salary)) +
geom_boxplot()
brand_salary
brands %>%
ggplot(aes(x = salary, fill = elevel)) +
geom_histogram(bins = 25)
brands %>%
ggplot(aes(x = elevel, fill = brand)) +
geom_bar()
brands %>%
ggplot(aes(x = elevel, fill = brand)) +
geom_bar(position= "fill")
brands %>%
ggplot( aes(x = elevel, y = brand)) +
geom_jitter() + coord_flip()
brands %>%
ggplot(aes(x=age, y = elevel, color=brand)) + geom_jitter() +coord_flip()
ggplot(brands, aes(salary, age)) + geom_point() + theme_classic() + geom_point(color = ifelse(brands$brand == "Belkin", 'red', 'black'))
ggplot(brands, aes(salary, age)) + geom_point() + theme_classic() + geom_point(color = ifelse(brands$brand == "Elago", 'purple', 'black'))
##Correlation matrix and variable importance
Here we want to check correlation between our variables. First we create a table and a plot with correlations to our target brand
correlation_table(data=brands, target="brand")
## Variable brand
## 1 brand 1.00
## 2 salary -0.02
## 3 credit -0.02
## 4 age -0.35
variable_importance <- var_rank_info(data=brands, target="brand")
ggplot(variable_importance, aes(x = reorder(var, gr), y = gr, fill = var)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_bw() +
xlab("") +
ylab("Variable Importance
(based on Information Gain)"
) +
guides(fill = FALSE)
* en entropy measured in bits * mi mutual information * ig information gain * gr gain ratio
We see that the most important variable for predicting brand would be salary followed with credit, and then age and elevel. We want to check if salary and credit are highly corelated.
cor(brands$salary, brands$credit)
## [1] 0.8700001
It seems that they are highly corelated. We will have to omit credit variable from our model.
Let us continue with the funModeling library and make some cross plots. This plot intent to show in real scenarios if a variable is or not important, making a visual summary of it, (by grouping numerical variables into bins/groups).
brands$salary_discretized =
equal_freq(var=brands$salary, n_bins = 8)
summary(brands$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 52109 84969 84897 117168 150000
cross_plot(brands, input="salary_discretized", target="brand", auto_binning = FALSE)
cross_plot(brands, input="age", target="brand")
## Plotting transformed variable 'age' with 'equal_freq', (too many values). Disable with 'auto_binning=FALSE'
cross_plot(brands, input="elevel", target="brand")